fix: correct CPU usage graph pinned at 100%#14
Conversation
The CPU graph was pinned at 100% because host_cpu_seconds_total from Vector is a per-core, per-mode counter. Summing all modes (including idle) across all cores meant the delta always exceeded wall-clock time, so (delta/dt)*100 was always >100% and got clamped. Fix: track idle CPU seconds separately and compute utilization as (total - idle) / total * 100, which is core-count independent and gives accurate whole-server CPU utilization. Changes across the full stack: - Agent scraper: filter by mode label, sum idle+iowait separately - Agent structs/heartbeat: add CpuSecondsIdle field - Server heartbeat route: accept and store cpuSecondsIdle - Prisma schema + migration: add cpuSecondsIdle column - Fleet router: return new field - Frontend chart: new formula using idle delta
Greptile SummaryThis PR fixes a CPU usage graph that was permanently pinned at 100% by introducing a separate
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant V as Vector (Prometheus)
participant S as scraper.go
participant H as heartbeat.go
participant API as /api/agent/heartbeat
participant DB as PostgreSQL (NodeMetric)
participant FE as node-metrics-charts.tsx
V->>S: host_cpu_seconds_total{mode="idle"} += x
V->>S: host_cpu_seconds_total{mode="iowait"} += y
V->>S: host_cpu_seconds_total{mode="user"} += z
note over S: CpuSecondsTotal += all modes<br/>CpuSecondsIdle += idle + iowait
S->>H: HostMetrics{CpuSecondsTotal, CpuSecondsIdle}
H->>API: POST heartbeat {cpuSecondsTotal, cpuSecondsIdle}
API->>API: Zod validate (both optional)
API->>DB: NodeMetric.create({cpuSecondsTotal, cpuSecondsIdle})
FE->>DB: fleet.nodeMetrics query (cpuSecondsTotal, cpuSecondsIdle selected)
DB-->>FE: time-series rows
note over FE: cpuPercent = (totalDelta - idleDelta) / totalDelta * 100<br/>clamped to [0, 100]
Last reviewed commit: 1663a49 |
| MemoryUsedBytes: sr.Host.MemoryUsedBytes, | ||
| MemoryFreeBytes: sr.Host.MemoryFreeBytes, | ||
| CpuSecondsTotal: sr.Host.CpuSecondsTotal, | ||
| CpuSecondsIdle: sr.Host.CpuSecondsIdle, |
There was a problem hiding this comment.
Misaligned indentation on new field
The CpuSecondsIdle line uses one fewer tab than every other field in the same struct literal. Running gofmt would flag this. While it compiles fine, it breaks visual alignment and will cause noisy diffs in future edits.
| CpuSecondsIdle: sr.Host.CpuSecondsIdle, | |
| CpuSecondsIdle: sr.Host.CpuSecondsIdle, |
Prompt To Fix With AI
This is a comment left during a code review.
Path: agent/internal/agent/heartbeat.go
Line: 63
Comment:
**Misaligned indentation on new field**
The `CpuSecondsIdle` line uses one fewer tab than every other field in the same struct literal. Running `gofmt` would flag this. While it compiles fine, it breaks visual alignment and will cause noisy diffs in future edits.
```suggestion
CpuSecondsIdle: sr.Host.CpuSecondsIdle,
```
How can I resolve this? If you propose a fix, please make it concise.| if mode == "idle" || mode == "iowait" { | ||
| sr.Host.CpuSecondsIdle += value | ||
| } |
There was a problem hiding this comment.
Consider excluding iowait from the idle bucket
Classifying iowait as "idle" means I/O-bound workloads will show artificially low CPU utilization on the graph. For example, a system that's 80% blocked on disk reads will report ~20% CPU busy even though it's clearly under stress.
Standard tools like iostat and htop report iowait as a separate category precisely to make I/O pressure visible. The field is also named CpuSecondsIdle, which implies pure idle time.
If the intent is "CPU not doing compute work", renaming the field to CpuSecondsNonBusy (and documenting that it includes iowait) would at least make the semantics explicit. Alternatively, tracking idle only and displaying iowait as a separate series in the chart gives users richer diagnostic information.
Prompt To Fix With AI
This is a comment left during a code review.
Path: agent/internal/metrics/scraper.go
Line: 150-152
Comment:
**Consider excluding `iowait` from the idle bucket**
Classifying `iowait` as "idle" means I/O-bound workloads will show artificially low CPU utilization on the graph. For example, a system that's 80% blocked on disk reads will report ~20% CPU busy even though it's clearly under stress.
Standard tools like `iostat` and `htop` report `iowait` as a separate category precisely to make I/O pressure visible. The field is also named `CpuSecondsIdle`, which implies pure idle time.
If the intent is "CPU not doing compute work", renaming the field to `CpuSecondsNonBusy` (and documenting that it includes iowait) would at least make the semantics explicit. Alternatively, tracking `idle` only and displaying `iowait` as a separate series in the chart gives users richer diagnostic information.
How can I resolve this? If you propose a fix, please make it concise.
Summary
host_cpu_seconds_totalis a per-core, per-mode counter — summing all modes (including idle) across all cores meant the delta always exceeded wall-clock timeCPU% = (total - idle) / total * 100Changes
Full-stack fix across 7 files:
modelabel fromhost_cpu_seconds_total, sumidle+iowaitinto newCpuSecondsIdlefieldCpuSecondsIdletoHostMetricsstruct and heartbeat buildercpuSecondsIdlein heartbeat Zod schema, store in DBcpuSecondsIdle Float @default(0)column toNodeMetric+ migration(cpuDelta / dtSeconds) * 100with((totalDelta - idleDelta) / totalDelta) * 100Test plan
cpuSecondsIdledefault to 0 (CPU shows 100% until agent updates, same as before)